Search CORE

36 research outputs found

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter

Author: Chen Yong
Cook Brandon
Cooray Dulanya
Li Jie
Michelogiannakis George
Publication venue
Publication date: 12/03/2023
Field of study

Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC's Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter's lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation

arXiv.org e-Print Archive

PaST-NoC: A Packet-Switched Superconducting Temporal NoC

Author: Bautista Meriam Gay
Gonzalez-Guerrero Patricia
Lyles Darren
Michelogiannakis George
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/01/2023
Field of study

Temporal computing promises to mitigate the stringent area constraints and clock distribution overheads of traditional superconducting digital computing. To design a scalable, area- and power-efficient superconducting network on chip (NoC), we propose packet-switched superconducting temporal NoC (PaST-NoC). PaST-NoC operates its control path in the temporal domain using race logic (RL), combined with bufferless deflection flow control to minimize area. Packets encode their destination using RL and carry a collection of data pulses that the receiver can interpret as pulse trains, RL, serialized binary, or other formats. We demonstrate how to scale up PaST-NoC to arbitrary topologies based on 2x2 routers and 4x4 butterflies as building blocks. As we show, if data pulses are interpreted using RL, PaST-NoC outperforms state-of-the-art superconducting binary NoCs in throughput per area by as much as 5x for long packets.Comment: 14 pages, 18 figures, 2 tables. In press in IEEE Transactions on Applied Superconductivit

arXiv.org e-Print Archive

Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

Author: Arafa Yehia
Badawy Abdel Hameed
Bergman Keren
Cook Brandon
Dai Liang Yuan
Glick Madeleine
Michelogiannakis George
Shalf John
Wang Yuyang
Publication venue
Publication date: 17/07/2023
Field of study

The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202

arXiv.org e-Print Archive

Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization

Author: Butko Anastasiia
Carter Jonathan
Donofrio David
Iancu Costin
Michelogiannakis George
Shalf John
Siddiqi Irfan
Williams Samuel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2020
Field of study

Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a methodology to quantitatively assess the effectiveness of the ISA to encode quantum circuits for intermediate-scale quantum devices with O(

10^2

) qubits. The characterization model that we define reflects performance, the ability to meet timing constraint implications, scalability for future quantum chips, and other important considerations making them useful guides for future designs. Using our methodology, we propose scalar (QUASAR) and vector (qV) quantum ISAs as extensions and compare them with other ISAs in metrics such as circuit encoding efficiency, the ability to meet real-time gate cycle requirements of quantum chips, and the ability to scale to more qubits.Comment: 10 pages, 8 figure

arXiv.org e-Print Archive

eScholarship - University of California

Pre-Configured Routes

Author: George Michelogiannakis
George Michelogiannakis
Publication venue
Publication date
Field of study

In multi-core ASICs, processors and other compute engines need to communicate with memory blocks and other cores with latency as close as possible to the ideal of a direct buffered wire. However, current state of the art networks-on-chip (NoCs) suffer, at best, latency of one clock cycle per hop. We investigate the design of a NoC that offers close to the ideal latency in some preferred, run-time configurable paths. Processors and other compute engines may perform network reconfiguration to guarantee low latency over different sets of paths as needed. Flits in non-preferred paths are given lower priority than flits in preferred paths to enable the latter to provide low latency. To achieve our goal, we extend the “mad-postman ” technique [1]: every incoming flit is eagerly (i.e. speculatively) forwarded to the input’s preferred output, if any. This is accomplished with the mere delay of a single pre-enabled tri-state driver. We later check if that decision was correct, and if not, we forward the flit to the proper output. Incorrectly forwarded flits are classified as dead, and are eliminated in later hops

CiteSeerX

Recommended from our members

Variable-Width Datapath for On-Chip Network Static Power Reduction

Author: Michelogiannakis George
Publication venue: eScholarship, University of California
Publication date: 19/09/2014
Field of study

With the tight power budgets in modern large-scale chips and the unpredictability of application traffic, on-chip network designers are faced with the dilemma of designing for worst- case bandwidth demands and incurring high static power overheads, or designing for an average traffic pattern and risk degrading performance. This paper proposes adaptive bandwidth networks (ABNs) which divide channels and switches into lanes such that the network provides just the bandwidth necessary in each hop. ABNs also activate input virtual channels (VCs) individually and take advantage of drowsy SRAM cells to eliminate false VC activations. In addition, ABNs readily apply to silicon defect tolerance with just the extra cost for detecting faults. For application traffic, ABNs reduce total power consumption by an average of 45percent with comparable performance compared to single-lane power-gated networks, and 33percent compared to multi-network designs

eScholarship - University of California

Collective Memory Transfers for Multi-Core Chips

Author: Michelogiannakis George
Publication venue: eScholarship, University of California
Publication date: 21/04/2014
Field of study

Ezid

eScholarship - University of California

Recommended from our members

Collective Memory Transfers for Multi-Core Chips

Author: Michelogiannakis George
Publication venue: eScholarship, University of California
Publication date: 19/09/2014
Field of study

eScholarship - University of California